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out-of-order packet reception in a SAN includes requestor 
and responder nodes, coupled by a plurality of paths, that 
maintain the gpod and bad status of each path and also 
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SYSTEM AND METHOD FOR According to another aspect of the invention, if an error 

IMPLEMENTING ERROR DETECTION AND occurs on a path the requestor node implements a barrier 

RECOVERY IN A SYSTEM AREA NETWORK transaction on the path to determine if the failure is perma- 
nent or transient. 

CROSS-REFERENCES TO RELATED 5 According to another aspect of the invention, the barrier 

APPLICATIONS transaction is performed by writing a number chosen from a 

This application claims priority from Provisional App. lar S c numbcr s P acc m a wav mat minimizes the probability 

No. 60/070,650, filed Jan. 7, 1998, which is incorporated of rcusm g mc numbcr in a P cnod of ^ 

herein by reference. According to one aspect of the invention, the number is 

10 randomly chosen from a large number space. 

BACKGROUND OF THE INVENTION According to another aspect of the invention, the large 

Traditional network systems utilize either channel seman- number is based on the requestor ID and an incrementing 

tics (send/receive) or memory semantics (DMA) model. component managed by the requestor. 

Channel semantics tend to be used in I/O environments and 15 According to another aspect of the invention, if the failure 

memory semantics tend to be used in processor environ- is transient the requestor retransmits packets starting with 

ments. the packet that first caused an error condition to be detected. 

In a channel semantics model, the sender does not know According to another aspect of the invention, a sequence 

where data is to be stored, it just puts the data on the channel. number is included in each request packet and copied into 

On the sending side, the sending process specifies the 20 each response packet. A local copy of the sequence number 

memory regions that contain the data to be sent. On the is maintained at the requestor and responder nodes. If the 

receiving side, the receiving process specifies the memory sequence number in the request packet does not match the 

regions where the data will be stored. sequence number at the responder, a negative acknowledge 

In the memory semantics model, the sender directs the response packet is generated 

data to a particular location in the memory, utilizing remote 25 Other features and advantages of the invention will be 

direct memory access (RDMA) transactions. The initiator of apparent in view of the following detailed description and 

the data transfer specifies both the source buffer and desti- appended drawings, 
nation buffer of the data transfer. There are two types of 

RDMA operations, read and write. BRIEF DESCRIPTION OF THE DRAWINGS 

The virtual interface architecture (VIA) has been jointly 30 FIG. 1 is a block diagram depicting ServerNet protocol 

developed by a number of computer and software compa- layers implemented by hardware, where ServerNet is a SAN 

nies. VIA provides consumer processes with a protected, manufactured by the assignee of the present invention; 

directly accessible interface to network hardware, termed a ^ GS 2 and 3 are block diagrams depicting SAN topolo- 

virtual interface. VIA is especially designed to provide low gj es . 

latency message communication over a system area network A . , . .. , . . . . . # . 

/ni ' , r . . . , f FIG. 4 is a schematic diagram depicting logical paths 

(SAN) to facilitate multi-processing utilizing clusters of between end nodes of a S AN? 
processors. 

aoavt- . . 4 , .... j • »i 1 FIG. 5 is a schematic diagram depicting routers and links 

A SAN is used to interconnect nodes within a distributed connectin SAN end nodes* 

computer system, such as a cluster. The SAN is a type of ^ s ' 

network that provides high bandwidth, low latency commu- FIG - 6 is a depicting the transmission of request and 

nication with a very low error rate. SANs often utilize response packets between a requestor and a responder end 

fault-tolerant capability. n °d e - FIG. 6 shows the sequence numbers used in packets 

Ti . . ^ *r *u oakt* • « l' l 1 ■ l*i - *w j for three Send operations, an RDMA operation, and two 

It is important for the SAN to provide high reliability and ..... . 0 , r ' ™ ,. r ' 4 . 

i.- t_ u j -j.L 1 1 * 1 en *i_ additional Send operations. The diagram shows the 

high-bandwidth, low latency communication to fulfill the 45 , . . . , . 1 • n_ 

1 c *i~ \/T a r? jL ./. . „ it a. cam. u sequence numbers maintained in the requestor logic, the 

goals of the VIA. Further, it is important for the SAN to be ^ , . . , . L , / * 

\- r j *" t , * ,u sequence number contained in each packet, and the sequence 

able to recover from errors and continue to operate in the n . . t . . 4 1 , , * . , ^ 

4 c <• «t „ \ , numbers maintained at the responder logic: 

event of equipment failures. Error recovery must be accom- ^ 

plished without high CPU overhead associated with all FIG. 7 is two interlocked state diagrams showing the state 

transactions. Furthermore, error recovery should not 50 that software on the requestor and responder moves through 

increase the complexity for the consumer of VIA services. ^ or eacn P atn » 

FIG. 8 is a graph depicting retransmission during error 

SUMMARY OF THE INVENTION recovery due to a lost request packet; and 

According to one aspect of the present invention, a SAN FIG. 9 is a graph depicting retransmission during error 

maintains local copies of a sequence number for each data 55 recovery due to a lost acknowledgment packet, 
transfer transaction at the requester and responder nodes. 
Each data transfer is implemented by the SAN as a sequence 

of request/response packet pairs. An error condition arises if The preferred embodiments will be described imple- 

a response to any request packet is not received at the mented in the ServerNet II (ServerNet) architecture, manu- 

requesting node. The responder and requestor nodes are 60 factured by the assignee of the present invention, which is a 

coupled by a plurality of paths and each node maintains a layered transport protocol for a System Area Network 

record of the good or bad status of each path. If a transaction (SAN) optimized to support the Virtual Interface (VI) archi- 

fails and the path is permanently bad, both nodes update lecture session layer which has stringent user-space to 

their status to indicate that the path is bad. This is to prevent user-space latency and bandwidth requirements. These 

further transactions from including any stale requests that 65 requirements mandate a reliable hardware (HW) message 

are potentially still in the network from arriving at the transport solution with minimal software (SW) protocol 

destination and potentially corrupting data. stack overhead. The ServerNet II protocol layers for an end 
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node VI Network Interface controller/Card (NIC) and for a ordered or unordered, whether there is immediate data 

routing node are illustrated in FIG. 1. A single NIC and VI and/or whether this is the first or last packet in a session 

session layer may support one or two ports, each with its layer multi-packet transfer. Based on the VI transaction 

associated transaction, packet, link- level, MAC (media type and control information, a 32 bit Immediate data 

access) and physical layer. Similarly, routing nodes with a 5 field or a 64 bit Virtual address may follow the VI ID 

common routing layer may support multiple ports, each with number. 

its associated link-level, MAC and physical layer. vi pay i oa d data field carries up to 512 bytes of data 

Support for two ports enables the ServerNet II SAN to be between requesters and responders and may contain a 

configured in both non-redundant and redundant (fault pad byte. 

tolerant, or FT) SAN configurations as illustrated in FIG. 2 w vii ^ CRC field ^Mutis a checksum computed over 
and FIG. 3. On a fault tolerant network, a port of each end tne en tire packet, 
node may be connected to each network to provide contin- 
ued VI message communication in the event of failure of one Transaction Overview 
of the SANs. In the fault tolerant SAN, nodes may be ported _ , . a c . , VT „ 
into a single fabric or single ported end nodes may be 15 ™ e b ,f c fl ° w ° f ,ra u n ^^ ns lhrou S h Savcrttt " 
grouped into pairs to provide duplex FT controllers. The ^AN will now be described. VI requms the support of Send, 
fabric is the collection of routers, switches, connectors, and * D ^ ™«» ™* RDMA write transactions. These are trans- 
cables that connect the nodes in a network. lated W ses f l0n la y er m !° a •? ° f *™rNet II 

, o kt • , transactions (request/response packet pairs). All data trans- 

The foUowingdescnbes general ServerNet II terminology fc ( ^ digk fil(J cplJ a d m u 

and concepts. The use of .he term "layer" in the following » ^ * of daU B from , ^ fafm direc „ ^ ^.^td 

description js m ended to describe functionality and does not cominunicitioilg ^ one end Dode £np ly interrupting 

imply gate level partitioning. another) consist of one or more such transacdons. 

Two ports are supported on a NIC for both performance 

and fault tolerance reasons. Both of these ports operate Creating a Request Packet 

under the same session layer VI engine. That is, data may 25 ™ ,„ , t * . Jt _. . , 

arrive on any port and be destined for any VI. Similarly, the ™ e c ^ U * ' £f * P ^ 'd ™* roUtmeS 

Vis on the end node can generate data for any of these ports. ™ Send >. RDMA W " ,e > ? nd RDMA ^operations, 

o ki.ii ,. -jr • rj. These routines place a descriptor for the desired transfer in 

ServerNet II packets are compnsed of a series , of data me appropriate Vl queue and notify the VIA hardware that 

symbols followed by a packet framing command. Other ^ ™^ fa ^ for ^ ^ v, A hardware 

commands used for flow control vutual channel support ^ ^ descrf / ^ Qn ( * 

and other hnk management functions, may be embedded ^ ^ ^ ke , header 

within a packet. Each request or response packet defines a ^ daU load (jf —^j,* x 

variety of information for routing, transaction type, \ rr r / 

verification, length and VI specific information. 35 Dual Ports and Ordering 

i. Routing in the ServerNet II SAN is destination-based % „ . , . . , „ 

using the first 3 bytes of the packet. Each NIC end node . In / mC wth two P ort f > lt * P°!f lble for a smgle VIA 
port in the network is uniquely defined by a 20 bit Port interface to process Sends and RDMA operations from 
SNID (ServerNet Node ID). Hie first 3 bytes of a ™™ al di^rent Vis in parallel. It is also possible for a large 
packet contain the Destination port's SNID or DID M RDMA transfer from a single VI to be transferred on both of 
(destination port ID) field, a three bit Adaptive Control 40 ^ simultaneously. This latter feature is called Multi- 
Bits (ACB) field and the fabric ID bit. The ACB is used P athlI1 g- 

to specify the path (deterministic or link-set adaptive) ServerNet II end nodes can connect both their ports to a 

used to route the packet to its destination port as single network fabric so that there are up to four possible 

described in the following section. paths between ServerNet II end nodes. Each port of a single 

ii. The transaction type fields define the type of session 4S end node ma y have a vnqpiG ServerNet ID (SNID). FIG. 4 
layer operation that this ServerNet II packet is carrying depicts the four possible paths that End node A can use when 
and other information such as whether it is a request or sending request to End node B: 

a response and, if a response, whether it is an ACK 1) End node A SNID[0] to End node B SNID[0] 

(acknowledgment) or a NACK (negative 50 2) End node A SNID[0] to End node B SNID[1] 

acknowledgment). The ServerNet II SAN also supports 3) End node A SNID[1] to End node B SNID[0] 

other transaction types. 4) Eod node A SN1D ri] t0 End node B S NID[1] 

in. Transaction verification fields include the source port piG. 5 depicts a network topology utilizing routers and 

ID (SID) and a Transaction Serial Number. The trans- links. i n FIG. 5, end nodes A-F, each having first and second 

action serial number enables a port with multiple 55 sen d receive ports 0 and 1, are coupled by a ServerNet 

requests outstanding to uniquely match responses to topology including routers R1-R5. Links are represented by 

requests. lines coupling ports to routers or routers to routers. A first 

iv. The Length field consists of an encoding of the number adaptive set (fat pipe) 2 couples routers RI and R3 and a 
of bytes of payioad data in the packet. Payloads up to second adaptive set (fat pipe) 4 couples routers R2 and R4. 
512 bytes are supported and code space is reserved for 60 Routing may be deterministic or link set adaptive. An 
future increases in payioad size. adaptive link-set is a set of links (also called lanes) between 

v. The VI Session Layer specific fields describe VI two routers that have been grouped to provide higher 
information such as the VI Operation, the VIA bandwidth. The Adaptive Control Bits (ACB) specify which 
Sequence number and the Virtual Interface ID number. type of routing is in effect for a particular packet. 

The VI Operation field defines the type of VI transac- 65 Deterministic routing preserves strict ordering for packets 

tion being sent (Send, RDMA Read, RDMA Write) and sent from a particular source port to a destination port. In 

other control information such as whether the packet is deterministic routing, the ACB field selects a single path or 
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lane through an adaptive link-set. Send transactions for a packet status is carried by a link symbol TPG (this packet 

particular VI require strict ordering and therefore use deter- good) or TPB (this packet bad) appended at the end of the 

ministic routing. packet. Since packet status is checked on each link, a packet 

RDM A transactions, on the other hand, may make use of status transition (good to bad) can be attributed to a specific 

all possible paths in the network without regard for the 5 link. The packet routing process described above is repeated 

ordering of packets within the transaction. These transac- for each router node in the selected path to the destination 

tions may use link-set adaptive routing as described below. code. 
The ACB field specifies which specific link (or lane) in this 

link-set is to be used for deterministic routing. Receiving a Request Packet 

Alternatively, the ACB field can specify link-set adaptiv- 10 „,, . . . , . . . 

ity which cables the packets to dynamically choose from ,. ^ ,h ° P acket amves a ' ^ <!f^f °° ° ode - 

any of the links in the link-set. me ServerNet II mterface receiver checks ite vahdity (e.g., 

A sample topology with several different examples of mnSL "? to ™ *°™ ct , de fr«'°° °° dc ™- the 

mi .it,wfc;„~ nn A v <.u mm *„ correct, the Fabric bit in the packet matches the Fabric bit 

multipatning using link and path adaptivity is shown in FIG, • j • . . . *. , , 

5 15 associated with the receiving port, the request field encodes 

' Multipatning allows large block transfers done with ? va , Ud i^ 1 ' CR U C mu f be go°dOJf 'bc packet is 

RDMA Read or Write operations to simultaneously use both !" val ! d for a ? y rcas0D ' ,he P ackct 15 d,sca f ,ded - ™° Ser T" 

ports as well as adaptive links between the two communi- Ne ' 11 mt ?*« may r sa y e c ™ status fo j by 

eating NICs. Since me data transfer characteristics of any S ? ft T are ' If lh f° c vahdl fl ty 1 f UC0Bed - xyetal R m0K 

one VI are expected to be bursty, multipatning allows the 20 « ^F^* rf lhe . T™ W 

„„ A n ^ t _ Jl„u,i 0 ii Wo ^ c ' ^ fZ- » P ;„„i 0 t „ ncfo , RDMA Read or Write, the address is checked to ensure 

end node to marshal all its resources tor a single transfer. , , . ' r ^. , ,„ A1 . 

Note that multipatning does not increase the throughput of *«*» has enab '« d 1°' ,hat P arUcular Y 1 /^ 0 ' the 

multiple Send operations from one VI. Sends fromone VI input port and Source 'D of the packet are checked to ensure 

must be sent strictly ordered. Since there are do ordering acecss 0 ll ! c P^icular VI .s allowed on that input port from 

guarantees between packets originating from different port? 25 J e P 3 " 1 ^ If the P acket 18 vahd * ^ caD 

on a NIC, only one port may be used per Send. Furthermore, e °° m P c c * 

only a single ordered path through the Network may be used, Response Packet 

as described in the following. v 

A response is created based on the success (ACK 

Transaction and Packet Layers 30 response) or failure (NACK response) of the request packet. 

The transaction layer builds the ServerNet II request A successful read request, for example, would include the 

packet by filling in the appropriate SID, Transaction Serial read data m the ACK response. The source node ID from the 

Number (TSN), and CRC. The SID assigned to a packet rec l uest P ackel is used as the destination node ID for the 

always corresponds to the SNID of the port the packet response packet. The response packet must be returned to 

originates from. The TSN can be used to help the port 35 ori S mal source P° rl * ^ P atD taken b V lhe response is 

manage multiple outstanding requests and match the result- not necessarily the reverse of the path taken by the request, 

ing responses uniquely to the appropriate request. The CRC ^ network mav be configured so that responses take very 

enables the data integrity of the packet to be checked at the different paths than requests. If strict ordering is not 

end node and by routers enroute. required, the response, like the request, may use link-set 
c n . e » t 4 1 1 i ■ i . i ..40 adaptivity. The response packet is routed back to the SNID 

Following the ServerNet II link protocol, the packet is .„ /. OT Vv n ,f c t1 _ . a^o * u r 

A A * . e . , uirii ju specified by the SID field of the request. The ACB field of 

encoded in a series of data symbols followed by a status t f A . i » • i j i • ♦ j * .u i * 

a t*u c kt * ii .- I i *u j the request packet is also duplicated for the response packet, 

command. The ServerNet II link layer uses other commands r r r 

for flow control and link management. These commands The response can be matched with the request using the 

may be inserted anywhere in the link data stream, including «"* me P acket validltv checks. If an ACK response 

between consecutive data symbols of a packet. Finally, the 45 ^ >asxs these tesls > the transaction layer passes the response 

symbols are passed through the MAC layer for transmission data t0 me session la y er . &ees resources associated with the 

on the physical media to an intermediate routing node. request, and reports the transaction as complete. If a NACK 

response passes these tests, the end node reports the failure 

Routing of the transaction to the session layer. If a valid ACK/NACK 

_ . . „ , ... 50 response is not received within the allotted time limit, a 

The routing control function is programmable so that the dmt ^ u error is repor t e a\ 

packet routing can be changed as needed when the network _ . , 

configuration changes (e.g., route to new end nodes). Router „ ^ requestor can stream many strictly ordered ServerNet 

nodes serve as crossbar switches; a packet on any incoming 11 messages onto the wire before receiving an acknowledg- 

(receive) side of a link can be switched to the outgoing 55 ^ ^ ^ dm S ™ ad ™ P™ 10 ? 0 ' the re 1 uestor 10 

(transmit) side of any link. As the incoming request packet have U P t0 128 P ackets ™™zndwg per VI. 

arrives at a router node, the first three bytes, containing the ^ hardware can operate in one of two modes with 

DID and ACB fields, are decoded and used to select a link respect to generating multiple outstanding request packets: 

leading to the destination node. If the transmit side of the 1. The hardware can stream packets from the same VI 

selected link is not busy, the head of the packet is sent to the so scnd queue onto the wire, and start the next descriptor before 

destination node whether or not the tail of the packet has receiving all the acknowledgments from the current descrip- 

arrived at the routing node. If the selected link is busy with tor. This is referred to as "Next Descriptor After Launch" or 

another packet, the newly arrived packet must wait for the NDAL. 

target port to become free before it can pass through the 2. The hardware can stream packets to a single descriptor 

crossbar. 65 on j 0 me 5 ut wa j t f or jjj e outstanding acknowledg- 

As the tail of the packet arrives, the router node checks the ments to complete before starting the next descriptor. This is 

packet CRC and updates the packet status (good or bad). The referred to as "Next Descriptor after ACK" or NDAA. 
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Hie choice of NDAL or NDAA modes of operation is RDMA request packets may be sent ordered or unordered. 

determined by how strongly ordered the packets are gener- A bit in the packet header is set to 1 for ordered packets and 

ated. is set to 0 for unordered packets. As will be explained later, 

Ordered and unordered messages may be mixed on a ^is bit is used by the responder logic to determine if it 

single VI. When generating an unordered message, the 5 should increment its copy of the expected sequence number. 

requester must wait for completion of all acknowledgments Sequence numbers do not increment for unordered packets. 

to unordered packets before starting the next descriptor. end node k free to use different source ports, destination 

ports and adaptive paths for the packets. This freedom can 

Ordering of Send Packets Presented to Transaction be exploited for a performance gain through multipathing; 

Layer 10 simultaneously sending the RDMA packets of a single 

. .. • j . i message across multiple paths. 

The VI architecture has no explicit ordering rules as to __ w . _ f V__ . 

how the packets that make up a single descriptor are ordered When mMA Rcad or Wntc P ackcts m *«« ovcr a P alh 

among themselves. That is, VIA only guarantees the mes- * at docs DOt cxhlblt strict ordering with the Send packets 

sage ordering the client will see. For example, VIA requires 1C from th <; same VI, care must be taken when launching 

that Send descriptors for a particular VI be completed in 15 Packets for the following message. The next message cannot 

order, but the VIA specification does not say how the packets be startcd untU tbc last acknowledgment of the RDMA Read 

will proceed on the wire. or Wntc operation successfully completes. 

Hie ServerNet II SAN requires that all Send packets * n , 0m 5 WOrds ' when mulli P athin g * ™* d t0 g enerale 

destined for a particular VI be delivered by the SAN in strict 20 * DMA Read or Wnte "^ests, the hardware must operate 

order. As long as deterministic routing is used, the network m me NDAA ra ? de - ^ ensures the RDMA Read or Write 

assures strict ordering along a path from a particular source 15 com P leted ^fore moving on to subsequent descriptors, 

node to a particular destination node. This is necessary An end aode may choose to send RDMA packets strictly 

because the receiving node places the incoming packets into ordered. This can be advantageous for smaller RDMA 

a scatter list. Each incoming packet goes to a destination 1S transfers as the hardware can operate in NDAL mode. The 

determined by the sum total of bytes of the previous packets. ^ caD proceed to the next descriptor immediately after 

The strict ordering of packets is necessary to preserve launching the last packet of a message that is sent strictly 

integrity of the entire block of data being transferred because ordered (and hence used incrementing sequence numbers), 

incoming packets are placed in consecutive locations within _ , . - _ , _ „ t 

the block of data. Each packet has a sequence number to 30 0fdermg ° f GenM » ted Re ^P onse Packels at the 

allow the receiver to detect an out of order, missing, or espon er 

repeated packet. The ServerNet II end node must respond to incoming 

There are two ways for an end node to meet these ordering Send requests and RDMA Write requests from a particular 

requirements: VI in strict order, and must write these packets to memory 

a. The end node can wait for the acknowledgment from 35 111 sul ct order. 

each Send packet to complete before starting another Send The ServerNet II end node must also respond to incoming 

packet for that VI. By waiting for each acknowledgment, the RDMA Read requests from a particular VI in strict order, 

end node does not have to worry about the network provid- Because response packets are transported by the network 

ing strict ordering and can choose an arbitrary source port, in strict order, the requestor will receive all incoming 

adaptive link set and destination port for each message. 40 response packets for a particular VI in the same order as that 

b. The end node can restrict all the Send operations for a in which the corresponding requests were generated, 
given VI to use the same source port, the same destination 

port, and a single adaptive path. By choosing only one path ^ Messa S e Sequence Numbers 

through the network, the end node is guaranteed that each The ServerNet SAN uses acknowledgment packets to 

Send packet it launches into the network will arrive at the 45 inform the requestor that a packet completed successfully, 

destination in order. Sequence numbers in the packets (and acknowledgments) 

The second approach requires the VIA end node to are used to. allow the sender to support multiple outstanding 

maintain state per VI that indicates which source port requests to ensure adequate performance and to be able to 

destination port and adaptive path is currently in use for that 5Q recover from errors occurring in the network, 

particular VI. Furthermore, the second approach allows the FIG. 6 is a graph depicting the generation, checking, and 

hardware to process descriptors in the higher performance updating of VIA sequence numbers at requestor and 

NDAL mode. responder nodes. In FIG. 6, time increases in the downward 

With the second approach, Send packets from a single VI direction. Requests are indicated by solid arrows directed to 

can stream onto the network without waiting for their 55 the right and responses by dotted arrows directed to the left, 
accompanying acknowledgments. An incrementing 

sequence number is used so the destination node can detect Sequence Number Initialization 

missing, repeated, or unordered Send packets. The requestor and responder logic each maintain an 8 bit 

sequence numbered for each VI in use. When the VI is 

Ordering of RDMA Packets 6Q created? the „q UCStcr on one nodc the respond on the 

RDMA operations have slightly different ordering remote node initialize their sequence numbers to a common 

requirements than Send operations. An RDMA packet con- value, zero i the preferred embodiment, 

tains the address to which the destination end node writes the After this, the requester places its sequence number into 

packet contents. This allows multiple RDMA packets within each of the outgoing request packets. As depicted in FIG. 6, 

an RDMA message to complete out of order. The contents of 65 the sequence number, SEQ, is included in each request 

each packet are written to the correct place in the end node's packet. The responder compares the sequence number from 

memory, regardless of the order in which they complete. the incoming request packet with the responder's local copy. 
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The responder uses this comparison to determine if the packet and throws it away. The receive logic in the VI 

packet is valid, if it is a duplicate of a packet already is not slopped and the responder does not increment its 

received, or if it is an out-of-sequence packet. An out-of- sequence number; or 

sequence packet can only happen if the responder missed an a duplicate packet (which is being resent because the 

incoming packet. The responder can choose to return a 5 requestor must not have received an earlier ACK) in 

'sequence error NACK packet' or it can simply ignore the wh i c b case the responder ACKs the packet and throws 

out-of-sequence packet. In the latter case, the requester will it away . If ^ rcqucst had betn M RDMA Rcad> thc 

have a time-out on the request (and presumably on the responder completes the read operation and returns the 

packet the responder missed) and initiate error recover. ^th a positive acknowledgment 

Generating a sequence error NACK packet is preferred as it to An example of the responder checking sequence numbers 

forces the requester to start error recovery more quickly. f or ordered and unordered packets is given in FIG. 6. In FIG. 

The following describes how the sequence numbers are 6, during thc first two Send transactions, the responder 

generated and checked. checks that the SEQ in the packet matches the local copy of 

Rsp. SN. Since the Send packets include ACB indicating 

Generating Sequence Numbers for Request Packets is ordered packets, the Rsp. SN is incremented after each 

When transmitting ordered packets (i.e. transfers are on a response packet is transmitted. At thc end of the first two 

specific source port to a specific destination port and the Scnd transactions, Rqst. SN and Rsp. SN both equal 6. The 

ACB specifics a specific lane) the request sequence number packets for the RDMA include an ACB indicating unordered 

is incremented after each packet is sent. When transmitting rccei P l fa ^ lowcd - Neither the requestor or responder incre- 

unordered packets (i.e. multipathing is used and/or the ACB 20 mcnts its local C0 Py of SN - Thus ' at end of thc RDMA 

bits specify full link set adaptivity) the request sequence transaction both Rqst. SN and Rsp. SN=6. The first packet 

number is not incremented after such a packet is sent. of thc subsequent Send transaction has SEQ=6 and SEQ 

Enr a , • TTtr- a a *u c ♦ *„ c- j matches the local copy of Rsp. SN. Since Send packets are 

For example, in FIG. 6 f during the first two Send , . , . * . i 

t™™*,- „ «u„ i i M c «u , . ordered, the responder increments its local copy of Rsp. SN. 

transactions, the local copy of the request sequence number 25 

is incremented after the packet is sent (Rqst. SN-0 to 6). For Sequence Numbers on Response Packets 
the RDMA operation, which sends 2500 bytes unordered, 

the requester does not increment local copy of the request When generating either a positive or negative 

sequence number (Rqst. SN-6). The requester does not acknowledgment, the responder logic copies the incoming 

increment the local copy of the SN until after the first packet 30 sequence number and uses it in the sequence number field of 

of the Send following the RDMA is transmitted. me acknowledgment. 

Send packets are typically sent fully ordered lest the requestor logic matches incoming responses with the 

requestor have to wait for an acknowledgment for each originating request by comparing the SourcelD, VI number, 

packet before proceeding to the next. On the other hand, Sequence number, transaction type, and transaction Serial 

RDMA packets may be sent either ordered or unordered. To 35 Number (TSN) with that of the originating request, 

take advantage of multipathing, a requestor must use unor- „ _ , « . « 

dered RDMA packets. Error Rec °vcry and Path State 

The sender guarantees to never exceed the windowsize Error recovery is initiated by the requesting node when- 

mimber of packets outstanding per VI. If S is the number of ever the requestor fails to get a positive acknowledgment for 

bits in the sequence number, then the windowsize is 2**(S- *o eaco of its request packets. A time-out or NACK indicating 

1). a sequence number error, can cause the requestor's Kernel 

A packet is outstanding until it and all its predecessors are A S eot to start erTor recoverv * 
acknowledged Tne requestor does not mark a descriptor EnQI reco mvoIves three basic 

done until all packets requested by that descriptor are 

positively acknowledged. 45 1) Completing a barrier operations) to flush out any 

errant request or response packets. 
Checking Sequence N ™bers on Incoming Request 2) Disabling a bad path if the 5arrier opcmion failed . . 

3) Retransmitting from the earliest packet that had failed. 

The destination node responding to the incoming request 5Q The first two steps will now be described with reference 

packet checks each incoming request packet to verify its to FIG. 7, which is two interlocked state diagrams showing 

sequence number against the responded local copy. the state that software on the requestor and responder moves 

The responder logic compares its sequence number with through for each path. In FIG. 7, dashed lines represent 

the packet's sequence number to determine if the incoming Kernel to Kernel Supervisory Protocol messages that modify 

packet is either: 55 the remote node's state. 

the expected packet it is looking for (i.e., the packet's The ServerNet architecture allows multiple paths between 

sequence number is the same as the sequence number end nodes. The requestor repeats these two basic steps on 

maintained by the responder logic), in which case the each path until the packet is transmitted successfully, 

responder processes the packet and if all other checks The requestor and responder SW each maintain a view of 

are passed, the packet is Acknowledged and committed 60 the state of each path. The requestor uses its view of the path 

to memory. If the transaction is ordered, the responder state to determine which path it uses for Send and RDMA 

increments its sequence number. If the transaction is operations. The responder uses its view of the path state to 

unordered, the responder does not increment its determine which input paths it allows incoming requests on, 

sequence number; The responder logic maintains a four bit field (ReqlnPath 

an out-of-sequence packet (which means an earlier 65 Vector) for each VI in use. Each of the four bits corresponds 

incoming packet must have gotten lost), beyond the one to one of the four possible paths between the requestor's two 

it is looking for in which case the responder NACKs the ports and the responder's two ports. The requestor only 
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accepts incoming requests from a particular source or des- send four barrier operations (one for each lane) or it can 

tination port if the corresponding bit in the ReqlnPath Vector maintain state, telling how many lanes are in use on the fat 

is set pipes between any given source/destination pair. The barrier 

The requestor and responder communicate using the need only be sent along the same path as the original request 

kernel-to-kernel Supervisory protocol to communicate path 5 that failed. There is no requirement for the barrier to be sent 

state changes. from or to the same VI. 

The requester's view of the path state transitions from Turning now to the third step, i.e., retransmitting from the 

good to bad whenever the requestor fails to get an acknowl- earliest packet that failed to receive a positive 

edgment (either positive or negative) to a request. The acknowledgement^, after] After notification of the error, 

requestor detects the lack of an ACK or NACK by getting a 10 r stQr retransmits me packets starting at (or before) lhe 

time-out error. I^e requestor can attempt a barner operation ket ^ &fled {Q feceive fl ^ acknowledgement, 

on the path to see the failure is permanent or transient If ^ e tQr caQ £ WifldowSize nu ^ ber of 

the barner succeeds, the path is considered good and the , . r_ , , . , , , , . 

original operation can be retried. If the barrier fails, the P acketS ' 'J* ™f° n6< " to^ a^owledg^ ten «d« 

requestor must resort to a different good path. c any packets that are resent if they have already been stored 

Before the requestor can try a different good path, the 15 m lbe rec * ve ? ueue ' th * ™ neci Packet is reached, the 

requestor must inform the destination that the orijnal path responder l°gic can tell from the sequence number that it is 

is bad. This is done by any path possible. For example, in D0W time t0 resume the data t0 receive queue. 

VIA the Kernel Agent to Kernel Agent Supervisory Protocol Examples of retransmission after failure to receive a 

is used. After the destination is informed the path is bad, the response are depicted in FIGS. 8 and 9. In FIG. 8, the request 

destination disables a bit in a four bit field (ReqlnPath 20 packet with SEQ=2 is corrupted. The missing request is 

Vector), thereby ignoring incoming requests from that path. detected by the responder on the next packet and NACKed 

The requester then stops using the bad path until a subse- (Negative Acknowledged) and all subsequent packets are 

quent barrier transaction determines that the path is good. thrown away and NACKed. The requestor resets its send 

After the destination acknowledges the supervisory protocol engine to start generating packets at the one that failed to 

message, indicating that the destination has disabled 25 receive an ACK (in this case Rqst. SN=2). The responder 

requests from the offending path, the requestor is free to recognizes the SEQ=2 and accepts the packets, 

retry the message on a different path. In FIG. 9, the response packet with SEQ=1 is corrupted. 

After a time-out error, the requester attempts to bring the The missing response is detected when the requestor times 

path back to a useful state by completing a barrier operation. out its transaction. The requestor resets its send engine to 

The barrier operation ensures there are no other packets in 30 start generating packets at the one that failed to receive an 

any buffer that might show up later and corrupt the data ACK (in this case rqst.SN«l). The responder recognizes the 

transfer. resent packets as already having been received, acknowl- 

Barrier operations are used in error recovery to flush any edges them and discards the data, 

stale request or stale response packets from a particular path Note that the response packets for this particular RDMA 

in the SAN. A path is the collection of ServerNet links 35 transaction all have the same value of SEQ because the 

between a specific port of two end nodes. request SNs and response SNs are not incremented for 

A VIA barrier operation is done with a RDMA Write RDMA transactions that are unordered. In this case, the 

followed by an RDMA Read. A number chosen from a large TSNs are utilized by the requestor to match response packets 

number space (either incrementing or pseudo random) is 4Q to outstanding requests. 

written to a fixed location (e.g. a page number agreed to, a Error recovery places several requirements on the request- 
priori, by the kernel agent-to-kernel agent Supervisory Pro- or's KA (Kernel Agent, the kernel mode driver code respon- 
tocol and either a fixed or random offset within the page). sible for SAN error recovery): 

The number is then read back with an RDMA read. If the 1) The KA must determine the sequence number to restart 

read value matches the write value, then the barrier sue- 45 with. 

ceeded and there are guaranteed to be no more Send or 2) The KA must determine the proper data contents of the 

RDMA request or response packets on that path between the packet to be resent. 

requester and responder. 3) i n ordcr f or mc ka t0 determine the appropriate 

If the RDMA operation fails because the number read sequence number, it must be aware of how the hardware 

back does not match the number written, then the barrier is 50 packetizes data under any given combination of descriptors, 

tried again. This could have happened because a previous data segments, page crossings etc, 

response in the network came back and fulfilled the barrier. Note the responder side does not require KA involvement 

Note that the barrier needs to be done separately on all (unless a barrier operation fails), 

paths the RDMA operation could have taken. That is, if the The invention has now been described with reference to 

RDMA operation was being generated from multiple source 55 the preferred embodiments. Alternatives and substitutions 

ports (multipathing) and was using full link adaptivity (the will n0 w be apparent to persons of skill in the art. For 

packets were allowed to take any one of four possible example, while the invention has been described in the 

"lanes'O, then separate barrier operations must be done from context of the ServerNet II SAN, the principles of the 

each source port to each destination port, over each of the invention are useful in any network that utilizes multiple 

possible link adaptive paths. 60 patns between end nodes. Accordingly, it is not intended to 

The barrier operation must be done for each of the limit the invention except as provided by the appended 

possible "lanes" between a specific Source port and Desti- claims, 

nation port. A barrier done on one VI ensures that all other What is claimed is: 

Vis using that source port and destination port have no 1. A method for error detection and recovery in a system 

remaining request or response packets lurking in the SAN. 65 with a plurality of networked nodes, including source and 

If a path traverses a "fat-pipe" a separate barrier must be destination, communicating with each other via paths, com- 

sent down each lane of the fat pipe. SW can either blindly prising: 
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creating a request packet for a request transaction, 

routing the request packet from the source to the desti- 
nation via a particular path; 

maintaining at each of the source and destination a status 
for each of its respective paths; 5 

detecting a time-out error for failure within a predeter- 
mined time limit to receive at the source an acknowl- 
edge (ACK) packet or a negative-acknowledge 
(NACK) packet in response to the request packet, the 
time-out error created from a failure of the particular io 
path; 

performing a barrier transaction via the particular path to 
determine if the failure of the particular path is transient 
or permanent; 

periodically repeating the barrier transaction via the par- 15 
ticular path in order to determine if its failure is cured; 

re-transmitting the request packet via the particular path if 
the failure is transient; and 

if the failure is permanent, 
updating at the source the status for the particular path, 2 o 
and 

routing to the destination information about the 
updated status via an alternate path to prompt updat- 
ing at the destination of the status for the particular 
pat wherein a failed path is not used. 

2. The method of claim 1, wherein when the transaction 
is ordered the routing is deterministic preserving strict 
ordering of request, ACK and NACK packets by selecting a 
single path as the particular path. 

3. A method as in claim 1, wherein when the transaction 

is not ordered the routing is link set adaptive enabling 30 
dynamic change of request, ACK and NACK packet routing 
by selection of paths from a link set. 

4. A method as in claim 1, wherein the request packet is 
discarded if it is invalid. 

5. A method as in claim 1, wherein the request packet 35 
contains a sequence number to indicate its place in a 
sequence of request packets and adaptive control bits to 
indicate the routing as ordered or not ordered. 

6. A method as in claim 1, further comprising: 
maintaining a sequence number in each of the source and 40 

destination, the sequence numbers being initialized to a 
common value; and 
including the sequence number in the request packet, 
wherein the sequence number is tracked by the source 
and destination, and wherein the NACK packet is AS 
created if a mismatch is found between the sequence 
number in the request packet and the sequence number 
at the destination. 

7. A method for error detection and recovery in a system 
with a plurality of networked nodes, including source and 5Q 
destination, communicating with each other via paths, com- 
prising: 

maintaining a sequence number in each of the source and 
destination, the sequence numbers being initialized to a 
common value; 55 

creating a request packet for a request transaction, the 
request packet containing the sequence number; 

routing the request packet from the source to the desti- 
nation via a particular path, wherein if the request 
transaction is ordered, the sequence number at the 60 
source is incremented; 

maintaining at each of the source and destination a status 
for each of its respective paths; 

checking the request packet for integrity and, if the 
request packet is valid, matching the sequence number 65 
in the packet with the sequence number at the destina- 
tion; 
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creating an acknowledge (ACK) packet for a response 
transaction if the request packet is valid and the 
sequence number matching succeeds, and routing the 
ACK packet to the source, the ACK packet containing 
the sequence number from the request packet; 

creating a negative acknowledge (NACK) packet for the 
response transaction if the sequence number matching 
fails, and routing the NACK packet to the source, the 
NACK packet containing the sequence number from 
the request packet; 

incrementing the sequence number at the destination if the 
request transaction is ordered and the sequence number 
matching succeeds; 

detecting a time-out error for failure within a predeter- 
mined time limit to receive at the source the ACK or the 
NACK packets in response to the request packet, the 
time-out error created from a failure of the particular 
path; 

performing a barrier transaction via the particular path to 
determine if the failure of the particular path is transient 
or permanent; 

periodically repeating the barrier transaction via the par- 
ticular path in order to determine if its failure is cured; 

re-transmitting the request packet via the particular path if 
the failure is transient; and 

if the failure is permanent, updating a status at the source 
for the particular path and routing to the destination 
information about the updated status via an alternate 
path to prompt updating of the status for the particular 
path at the destination, wherein a failed path is not used. 

8. A method as in claim 7, wherein when the transaction 
is ordered the routing is deterministic preserving strict 
ordering of packets by selecting a single path as the par- 
ticular path. 

9. A method as in claim 7, wherein when the transaction 
is not ordered the routing is link set adaptive enabling 
dynamic change of packet routing by selection of paths from 
a path set. 

10. A method as in claim 7, wherein the request packet is 
discarded if it is invalid. 

11. A method as in claim 7, wherein the request packet 
contains adaptive control bits indicating whether the trans- 
action is ordered or not ordered. 

12. A system for error detection and recovery with a 
plurality of networked nodes, including source and 
destination, communicating with each other via paths, com- 
prising: 

means for creating a request packet for a request trans- 
action; path means for routing the request packet from 
the source to the destination via a particular path; 

means for maintaining at each of the source and destina- 
tion a status for each of its respective paths; 

means for detecting a time-out error for failure within a 
predetermined time limit to receive at the source an 
acknowledge (ACK) packet or a negative-acknowledge 
(NACK) packet in response to the request packet, the 
time-out error created from a failure of the particular 
path; 

means for performing a barrier transaction via the par- 
ticular path to determine if the failure of the particular 
path is transient or permanent, including means for 
periodically repeating the barrier transaction via the 
particular path in order to determine if its failure is 
cured; 

means for re-transmitting from the source the request 
packet via the particular path if the failure is transient; 
and 
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if the failure is permanent, 

means for updating at the source the status for the 
particular path, and 

means for routing to the destination information about 
the updated status via an alternate path to prompt 5 
updating at the destination of the status for the 
particular path, wherein a failed path is not used 

13. The system of claim 12, wherein when the transaction 
is ordered the routing is deterministic preserving strict 
ordering of request, ACK and NACK packets by selecting a 10 
single path as the particular path. 

14. A system as in claim 12, wherein when the transaction 
is not ordered the routing is link set adaptive enabling 
dynamic change of request, ACK and NACK packet routing 
by selection of paths from a link set. 15 

15. A system as in claim 12, further comprising: 
means for discarding the request packet if it is invalid. 

16. A system as in claim 12, wherein the request packet 
contains a sequence number to indicate its place in a 
sequence of request packets and adaptive control bits to 20 
indicate the routing as ordered or not ordered. 

17. A system as in claim 12, further comprising: 
means for maintaining a sequence number in each of the 

source and destination, the sequence numbers being 
initialized to a common value; and 
means for including the sequence number in the request 
packet, wherein the sequence number is tracked by the 
source and destination, and wherein the NACK packet 
is created if a mismatch is found between the sequence 3Q 
number in the request packet and the sequence number 
at the destination. 

18. A system for error detection and recovery in a system 
with a plurality of networked nodes, including source and 
destination, communicating with each other via paths, com- 35 
prising: 

means for maintaining a sequence number in each of the 

source and destination, the sequence numbers being 

initialized to a common value; 
means for creating a request packet for a request 40 

transaction, the request packet containing the sequence 

number; 

means for routing the request packet from the source to 
the destination via a particular path, including means 
for incrementing the sequence number at the source if 45 
the request transaction is ordered; 

means for maintaining at each of the source and destina- 
tion a status for each of its respective paths; 

means for checking the request packet for integrity, 
including means for matching the sequence number in 



the packet against the sequence number at the destina- 
tion if the request packet is valid; 

means for creating an acknowledge (ACK) packet for a 
response transaction if the request packet is valid and a 
sequence number matching succeeds, including means 
for routing the ACK packet to the source, the ACK 
packet containing the sequence number from the 
request packet; 

means for creating a negative acknowledge (NACK) 
packet for the response transaction if the sequence 
number matching fails, including means for routing the 
NACK packet to the source, the NACK packet con- 
taining the sequence number from the request packet; 

means for incrementing the sequence number if the 
request transaction is ordered and the sequence number 
matching succeeds; 

means for detecting a time-out error for failure within a 
predetermined time limit to receive at the source the 
ACK or the NACK packets in response to the request 
packet, the time-out error created from a failure of the 
particular path; 

means for performing a barrier transaction via the par- 
ticular path to determine if the failure of the particular 
path is transient or permanent, including by periodi- 
cally repeating the barrier transaction via the particular 
path in order to determine if its failure is cured; 

means for re-transmitting the request packet via the 
particular path if the failure is transient; and 

if the failure is permanent, 

means for updating a status at the source for the 

particular path, and 
means for routing to the destination information about 
the updated status via an alternate path to prompt 
updating of the status for the particular path at the 
destination, wherein a failed path is not used. 

19. A system as in claim 18, wherein when the transaction 
is ordered the routing is deterministic preserving strict 
ordering of request, ACK and NACK packets by selecting a 
single path as the particular path. 

20. A system as in claim 18, wherein when the transaction 
is not ordered the routing is link set adaptive enabling 
dynamic change of request, ACK and NACK packet routing 
by selection of paths from a path set. 

21. A system as in claim 18, further comprising: 
means for discarding the request packet if it is invalid. 

22. A system as in claim 18, wherein the request packet 
contains adaptive control bits indicating whether the trans- 
action is ordered or not ordered. 
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ABSTRACT 



The invention provides a storage system, and a method for 
operating a storage system, that provides for relatively rapid 
and reliable takeover among a plurality of independent file 
servers. Each file server maintains a reliable communication 
path to the others. Each file server maintains its own state in 
reliable memory. Each file server regularly confirms the state 
of the other file servers. Each file server labels messages on 
the redundant communication paths, so as to allow other file 
servers to combine the redundant communication paths into 
a single ordered stream of messages. Each file server main- 
tains its own state in its persistent memory and compares 
that state with the ordered stream of messages, so as to 
determine whether other file servers have progressed beyond 
the file server's own last known state. Each file server uses 
the shared resources (such as magnetic disks) themselves as 
part of the redundant communication paths, so as to prevent 
mutual attempts at takeover of resources when each file 
server believes the other to have failed. Each file server 
provides a status report to the others when recovering from 
an error, so as to prevent the possibility of multiple file 
servers each repeatedly failing and attempting to seize the 
resources of the others. 

26 Claims, 2 Drawing Sheets 
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COORDINATING PERSISTENT STATUS 
INFORMATION WITH MULTIPLE FILE 
SERVERS 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The invention relates to computer systems. 

2. Related Art 

Computer storage systems are used to record and retrieve 
data. It is desireable for the services and data provided by the 
storage system to be available for service to the greatest 
degree possible. Accordingly, some computer storage sys- 
tems provide a plurality of file servers, with the property that 
when a first file server fails, a second file server is available 
to provide the services; and the data otherwise provided by 
the first. The second file server provides these services and 
data by takeover of resources otherwise managed by the first 
file server. 

One problem in the known art is that when two file servers 
each provide backup for the other, it is important that each 
of the two file servers is able to reliably detect failure of the 
other and to smoothly handle any required takeover opera- 
tions. It would be advantageous for this to occur without 
either of the two file servers interfering with proper opera- 
tion of the other. This problem is particularly acute in 
systems when one or both file servers recover from a service 
interruption. 

Accordingly, it would be advantageous to provide a 
storage system and a method for operating a storage system, 
that provides for relatively rapid and reliable takeover 
among a plurality of independent file servers. This advan- 
tage is achieved in an embodiment of the invention in which 
each file server (a) maintains redundant communication 
paths to the others, (b) maintains its own state in persistent 
memory at least some of which is accessible to the others, 
and (c) regularly confirms the state of the other file servers. 

SUMMARY OF THE INVENTION 

The invention provides a storage system and a method for 
operating a storage system, that provides for relatively rapid 
and reliable takeover among a plurality of independent file 
servers. Each file server maintains a reliable (such as 
redundant) communication path to the others, preventing 
any single point of failure in communication among file 
servers. Each file server maintains its own state in reliable 
(such as persistent) memory at least some of which is 
accessible to the others, providing a method for confirming 
that its own state information is up to date, and for recon- 
structing proper state information if not. Each file server 
regularly confirms the state of the other file servers, and 
attempts takeover operations only when the other file servers 
are clearly unable to provide their share of services. 

In a preferred embodiment, each file server sequences 
messages on the redundant communication paths, so as to 
allow other file servers to combine the redundant commu- 
nication paths into a single ordered stream of messages. 
Each file server maintains its own state in its persistent 
memory and compares that state with the ordered stream of 
messages, so as to determine whether other file servers have 
progressed beyond the file server's own last known state. 
Each file server uses the shared resources (such as magnetic 
disks) themselves as part of the redundant communication 
paths, so as to prevent mutual attempts at takeover of 
resources when each file server believes the other to have 
failed. 
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In a preferred embodiment, each file server provides a 
status report to the others when recovering from an error, so 
as to prevent the possibility of multiple file servers each 
repeatedly failing and attempting to seize the resources of 
5 the others. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 shows a block diagram of a multiple file server 
system with coordinated persistent status information. 
10 FIG. 2 shows a state diagram of a method of operation for 
a multiple file server system with coordinated persistent 
status information. 

DETAILED DESCRIPTION OF THE 
15 PREFERRED EMBODIMENT 

In the following description, a preferred embodiment of 
the invention is described with regard to preferred process 
steps and data structures. However, those skilled in the art 
would recognize, after perusal of this application, that 
20 embodiments of the invention may be implemented using 
one or more general purpose processors (or special purpose 
processors adapted to the particular process steps and data 
structures) operating under program control, and that imple- 
mentation of the preferred process steps and data structures 
25 described herein using such equipment would not require 
undue experimentation or further invention. 

In a preferred embodiment, the file server system, and 
each file server therein, operates using inventions described 
3() in the following patent applications: 

application Ser. No. 09/037,652, filed Mar. 10, 1998, in 
the name of inventor Steven Kleiman, titled "Highly 
Available File Servers," attorney docket number NAP- 
012. 

3S Each of these applications is hereby incorporated by 
reference as if fully set forth herein. They are collectively 
referred to as the "Clustering Disclosures." 

In a preferred embodiment, each file server in the file 
server system controls its associated mass storage devices so 

4Q as to form a redundant array, such as a RAID storage system, 
using inventions described in the following patent applica- 
tions: 

application Ser. No. 08/471,218, filed Jun. 5, 1995, in the 
name of inventors David Hitz et al., titled "A Method 
45 for Providing Parity in a Raid Sub-System Using 
Non-\blatile Memory", attorney docket number NET- 
004; 

application Ser. No. 08/454,921, filed May 31, 1995, in 
the name of inventors David Hitz et al., titled "Write 
50 Anywhere File-System Layout", attorney docket num- 
ber NET-005; 

application Ser. No. 08/464,591, filed May 31, 1995, in 
the name of inventors David Hitz et al., titled "Method 
for Allocating Files in a File System Integrated with a 
55 Raid Disk Sub-System", attorney docket number NET- 
006. 

Each of these applications is hereby incorporated by 
reference as if fully set forth herein. They are collectively 
referred to as the "WAFL Disclosures." 
60 System Elements 

FIG. 1 shows a block diagram of a multiple file server 
system with coordinated persistent status information. 

A system 100 includes a plurality of file servers 110, a 
plurality of mass storage devices 120, a SAN (system area 
65 network) 130, and a PN (public network) 140. 

In a preferred embodiment, there are exactly two file 
servers 110. Each file server 110 is capable of acting 



09/17/2003, EAST Version: 1.04.0000 



6,11 

3 

independently with regard to the mass storage devices 120. 
Each file server 110 is disposed for receiving file server 
requests from client devices (not shown), for performing 
operations on the mass storage devices 120 in response 
thereto, and for transmitting responses to the file server 
requests to the client devices. 

For example, in a preferred embodiment, the file servers 
110 are each similar to file servers described in the Clus- 
tering Disclosures. 

Each of the file servers 110 includes a processor 111, 
program and data memory 112, and a persistent memory 113 
for maintaining state information across possible service 
interruptions. In a preferred embodiment, the persistent 
memory 113 includes a nonvolatile RAM. 

The mass storage devices 120 preferably include a plu- 
rality of writeable magnetic disks, magneto-optical disks, or 
optical disks. In a preferred embodiment, the mass storage 
devices 120 are disposed in a RAID configuration or other 
system for maintaining information persistent across pos- 
sible service interruptions. 

Each of the mass storage devices 120 are coupled to each 
of the file servers U0 using a mass storage bus 121. In a 
preferred embodiment, each file server 110 has its own mass 
storage bus 121. The first file server 110 is coupled to the 
mass storage devices 120 so as to be a primary controller for 
a first subset of the mass storage devices 120 and a second- 
ary controller for a second subset thereof. The second file 
server 110 is coupled to the mass storage devices 120 so as 
to be a primary controller for the second subset of the mass 
storage devices 120 and a secondary controller for the first 
subset thereof. 

The mass storage bus 121 associated with each file server 
110 is coupled to the processor 111 for that file server 110 so 
that file server 110 can control mass storage devices 120. In 
alternative embodiments, the file servers 110 may be 
coupled to the mass storage devices 120 using other 
techniques, such as fiber channel switches or switched 
fabrics. 

The mass storage devices 120 are disposed to include a 
plurality of mailbox disks 122, each of which has at least one 
designated region 123 into which one file server 110 can 
write messages 124 for reading by the other file server 110. 
In a preferred embodiment, there is at least one designated 
region 123, on each mailbox disk 122 for reading and at least 
one designated region 123 for writing, by each file server 
110. 

The SAN 130 is coupled to the processor 111 and to the 
persistent memory 113 at each of the file servers 110. The 
SAN 130 is disposed to transmit messages 124 from the 
processor 111 at the first file server 110 to the persistent 
memory 113 at the second file server 110. Similarly, the 
SAN 130 is disposed to transmit messages 124 from the 
processor 111 at the second file server 110 to the persistent 
memory 113 at the first file server 110. 

In a preferred embodiment, the SAN 130 comprises a 
ServerNet connection between the two file servers 110. In 
alternative embodiments, the persistent memory 112 may be 
disposed logically remote to the file servers 110 and acces- 
sible using the SAN 130. 

The PN 140 is coupled to the processor 111 at each of the 
file servers 110. The PN 140 is disposed to transmit mes- 
sages 124 from each file server 110 to the other file server 
110. 

In a preferred embodiment, the PN 140 can comprise a 
direct communication channel, a LAN (local area network), 
a WAN (wide area network), or some combination thereof. 

Although the mass storage devices 120, the SAN 130, and 
the PN 140 are each disposed to transmit messages 124, the 
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messages 124 transmitted using each of these pathways 
between the file servers 110 can have substantially differing 
formats, even though payload for those messages 124 is 
identical. 

5 METHOD OF OPERATION 

FIG. 2 shows a state diagram of a method of operation for 
a multiple file server system with coordinated persistent 
status information. 
10 A state diagram 200 includes a plurality of states and a 
plurality of transitions therebetween. Each transition is from 
a first state to a second state and occurs upon detection of a 
selected event. 

15 The state diagram 200 is followed by each of the file 
servers 110 independently. Thus, there is a state for "this" 
file server 110 and another (possibly same, possibly 
different) state for the "the other" file server 110. Each file 
server 110 independently determines what transition to 

20 follow from each state to its own next state. The state 
diagram 200 is described herein with regard to "this" file 
server 110. 

In a NORMAL state 210, this file server 110 has control 
of its own assigned mass storage devices 120. 
25 In a TAKEOVER state 220, this file server 110 has taken 
over control of the mass storage devices 120 normally 
assigned to the other file server 110. 

In a STOPPED state 230, this file server 110 has control 
of none of the mass storage devices 120 and is not opera- 
30 tional. 

In a REBOOTING state 240, this file server 110 has 
control of none of the mass storage devices 120 and is 
recovering from a service interruption. 
NORMAL State 
35 In the NORMAL state 210, both file servers 110 are 
operating properly, and each controls its set of mass storage 
devices 120. 

In this state, each file server 110 periodically sends state 
information in messages 124 using the redundant commu- 
40 nication paths between the two file servers 110. Thus, each 
file server 110 periodically transmits messages 124 having 
state information by the following techniques: 
Each file server 110 transmits a message 124 by copying 
that message to the mailbox disks on its assigned mass 
45 storage devices 120. 

In a preferred embodiment, messages 124 are transmitted 
using the mailbox disks by writing the messages 124 to a 
first mailbox disk and then to a second mailbox disk. 
5Q Each file server 110 transmits a message 124 by copying 
that message 124, using the SAN 130, to its persistent 
memory 113 (possibly both its own persistent memory 
113 and that for the other file server 110). 
In a preferred embodiment, messages 124 are transmitted 
55 using the SAN 130 using a NUMA technique, 
and 

Each file server 110 transmits a message 124 by trans- 
mitting that message 124, using the PN 140, to the other 
file server 110. 

60 In a preferred embodiment, messages 124 are transmitted 
using the PN 140 using encapsulation in a communication 
protocol known to both file servers 110, such as UDP or IP. 

Each message 124 includes the following information for 
"this" file server 110 (that is, the file server 110 transmitting 
65 the message 124): 

a system ID for this file server 110; 
a state indicator for this file server 110; 
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In a preferred embodiment, the state indicator can be one one -ha If second for communication using the SAN 130. 

of the following: However, there is no particular requirement for using these 

(NORMAL) operating normally, timeout values; in alternative embodiments, different tim- 

/r*T^A,rrn^ a* , . 4 . eout values or techniques other than timeout periods may be 

(TAKEOVER) this file server 110 has taken over control M r 3 

of the mass storage devices 120, 5 am j 

(NO-TAKEOVER) this file server 110 does not want the The first file server 110 notes that the second file server 

receiving file server to take over control of its mass 110 has updated its state information (using one or more 

storage devices 120, and messages 124) to indicate that the second file server 110 has 

(DISABLE) takeover is disabled for both file servers 110. 10 changed its state. 

a generation number Gi, comprising a monotonically In a Preferred embodiment, the second file server 110 

increasing number identified with a current instantia- indicates when it is in one of the states described with regard 

lion of this file server 110; 10 each message 124. 

In a preferred embodiment, the instantiation of this file If lhe firsl fiIe 110 determines that the second file 

server 110 is incremented when this file server 110 is 15 t*™* 110 te also ™ lhe NORMAL state, the NORMAL- 

initiated on boot-up. If any file server 110 suffers a service OPERATION transition 211 is taken to remain in the state 
interruption that involves reinitialization, the generation 

number Gi will be incremented, and the message 124 will ^ first file t*™ 1 110 makes ils determination respon- 

indicate that it is subsequent to any message 124 send before sive 10 messages 124 it receives from the second file server 

the service interruption. 20 m If there m D0 messages 124 for a time period 

arK j responsive to the timeout period described above (such as 

a sequence number Si, comprising a monotonically two to five times the timeout period), the first file server 110 

increasing number identified with the current message 124 decides that the second file server 110 has suffered a service 

transmitted by this file server 110. interruption. 

Similarly, each message 124 includes the following infor- 25 If the first file s*"™ n0 determines that the second file 

mation for "the other" file server 110 (that is, the file server server 110 has suffered a service interruption (that is, the 

110 receiving the message 124)- second me server 110 is m me STOPPED state 230), the 

4 7 . * .. « ■ II TAKEOVER-OPERATION transition 212 is taken to enter 

a generation number Gi, comprising a monotonically . XA if Cn vtti? T5n 

. . i *j . ' c j • . i . • . me i /\_i\_t< vtK state ^xu. 

increasing number identified with a current instantia- t,.^^^,, Anrn ATIAM . . , 

c*i? *u m im in The TAKEOVER-OPERATION transition 212 can be 

lion of the other file server 110; -* u . . . . . nfPimr 

, disabled by a message 124 state indicator such as DISABLE 

ana or NO-TAKEOVER. 

a sequence number Si, comprising a monotonically In a pre f erre d embodiment, either file server 110 can 

increasing number identified with the most recent mes- disable the TAKEOVER -OPERATION transition 212 

sage 124 received from the other file server 110. responsive to (a) an operator command, (b) a synchroniza- 

Each message 124 also includes a version number of the tion error between the persistent memories 113, or (c) any 

status protocol with which the message 124 is transmitted. compatibility mismatch between the file servers 110. 

Since the file server 110 receives the messages 124 using Xo perform the TAKEOVER-OPERATION transition 

a plurality of pathways, it determines for each message 124 2 12, this file server 110 performs the following actions at a 

whether or not that message 124 is "new" (the file server 110 step 213 : 

has not seen it before), or "old" (the file server 110 has seen 40 ^ 110 me messa g e 124 state indicator 

it before). The file server 110 maintains a record of the TAKEOVER to the other file server 110, using including the 

generation number Gi and the sequence number Si of the reliable communication path (including the mailbox disks 

most recent new message 124. The file server 110 deter- 122> t he SAN 130, and the PN 140). 

mines that the particular message 124 is new if and only if: ^ ^ file ^ u0 ^ fof ^ othcf mc ^ m {Q 

its generation number Gi is greater than the most recent have the opportunity to receive and act on the 

new message 124; TAKEOVER-OPERATION transition 212 (that is, to 

or suspend its own access to the mass storage devices 120. 

its generation number Gi is equal to the most recent new This file server 110 issues disk reservation commands to 

message 124 and its sequence number Si is greater than 50 the mass storage devices 120 normally assigned to the 

most recent new message 124. other file server 110. 

If either of the file servers 110 determines that the This fii e server 110 takes any other appropriate action to 

message 124 is not new, that file server 110 can ignore that assure that the other file server 110 is passive, 

message 124. If the takeover operation is successful, the TAKEOVER- 

In this state, each file server 110 periodically saves its own 55 OPERATION transition 212 completes and this file server 

state information using the messages 124. Thus, each file enters the TAKEOVER state 220. Otherwise (such as if 

server 110 records its state information both on its own takeover is disabled), this file server 110 returns to the 

mailbox disks and in its own persistent memory 113. NORMAL state 210. 

In this state, each file server 110 periodically watches for TAKEOVER State 

a state change in the other file server 110. The first file server 60 i 0 the TAKEOVER state 220, this file server 110 is 

110 detects a slate change in the second file server 110 in one operating properly, but the other file server 110 is not. This 

of at least two ways: file server no has taken over control of both its and the 

The first file server 110 notes that the second file server other's mass storage devices 120. 

110 has not updated its state information (using a In this state, this file server 110 continues to write 

message 124) for a timeout period. 65 messages 124 to the persistent memory 113 and to the 

In a preferred embodiment, this timeout period is two-half mailbox disks 122, so as to preserve its own state in the 

seconds for communication using the mailbox disks and event of a service interruption. 
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la this state, this file server 110 continues to control all the 
mass storage devices 120, both its own and those normally 
assigned to the other file server 110, until this file server 110 
determines that it should give back control of some mass 
storage devices 120. 

In a preferred embodiment, the first file server 110 makes 
its determination responsive to operator control. An operator 
for this file server 110 determines that the other file server 
110 has recovered from its service interruption. The 
GIVEBACK-OPERATION transition 221 is taken to enter 
the NORMAL state 210. 

In alternative embodiments, the first file server 110 may 
make its determination responsive to messages 124 it 
receives from the second file server 110. If the second file 
server 110 sends messages 124 indicating that it has recov- 
ered from a service interruption (that is, it is in the REBOO- 
TING stale 240), the first file server 110 may initiate the 
GIVEBACK-OPERATION transition 221. 

To perform the GIVEBACK-OPERATION transition 
221, this file server 110 performs the following actions at a 
step 222: 

This file server 110 releases its disk reservation com- 
mands to the mass storage devices 120 normally 
assigned to the other file server 110. 
This file server 110 sends the message 124 state indicator 
NORMAL to the other file server 110, including using 
the mailbox disks 122, the. SAN 130, and the PN 140. 
This file server 110 disables the TAKEOVER- 
OPERATION transition 212 by the other file server 110 
until the other file server 110 enters the NORMAL state 
210. This file server 110 remains at the step 222 until 
the other file server 110 enters the NORMAL state 210. 
When the giveback operation is successful, the 
GIVEBACK-OPERATION transition 221 completes and 
this file server enters the NORMAL state 210. 
STOPPED State 

In the STOPPED state 230, this file server 110 has control 
of none of the mass storage devices 120 and is not opera- 
tional. 

In this state, this file server 110 performs no operations, 
until this file server 110 determines that it reboot. 

In a preferred embodiment, the first file server 110 makes 
its determination responsive to operator control. An operator 
for this file server 110 determines that it has recovered from 
its service interruption. The REBOOT-OPERATION transi- 
tion 231 is taken to enter the REBOOTING state 240. 

In alternative embodiments, the first file server 110 may 
make its determination responsive to a timer or other auto- 
matic attempt to reboot. When this file server 110 determines 
that it has recovered from its service interruption, it attempts 
to reboot, and the REBOOT-OPERATION transition 231 is 
taken to enter the REBOOTING state 240. 
REBOOTING State 

In the REBOOTING state 240, this file server 110 has 
control of none of the mass storage devices 120 and is 
recovering from a service interruption. 

In this state, the file server 110 attempts to recover from 
a service interruption. 

If this file server 110 is unable to recover from the service 
interruption, the REBOOT-FAILED transition 241 is taken 
and this file server 110 remains in the REBOOTING state 
240. 

If this file server 110 is able to recover from the service 
interruption, but the other file server 110 is in the TAKE- 
OVER state 220, the REBOOT-FAILED transition 241 is 
taken and this file server 110 remains in the REBOOTING 
state 240. In this case, the other file server 110 controls the 



mass storage devices 120 normally assigned to this file 
server 110, and this file server 110 waits for the 
GIVEBACK-OPERATION transition 221 before 
re-attempting to recover from the service interruption. 

5 If this file server 110 is able to recover from the service 
interruption, and determines it should enter the NORMAL 
slate 210 (as described below), the REBOOT-NORMAL 
transition 242 is taken and this file server 110 enters the 
NORMAL state 210. 

1(J If this file server 110 is able to recover from the service 
interruption, and determines it should enter the TAKEOVER 
state 210 (as described below), the REBOOT-TAKEOVER 
transition 243 is taken and this file server 110 enters the 
TAKEOVER state 210. 

In a preferred embodiment, this file server 110 performs 

15 the attempt to recover from the service interruption with the 
following steps. 

At a step 251, this file server 110 initiates its recovery 
operation. 

At a step 252, this file server U0 determines whether it is 

20 able to write to any of the mass storage devices 120 (that is, 
if the other file server 110 is in the TAKEOVER state 220). 
If so, this file server 110 displays a prompt to an operator so 
indicating and requesting the operator to command the other 
file server 110 to perform the GIVEBACK-OPERATION 

25 transition 221. 

This file server 110 waits until the operator commands the 
other file server 110 to perform a giveback operation, waits 
until the GIVEBACK-OPERATION transition 221 is 
complete, and proceeds with the next step. 

30 At a step 253, this file server 110 determines the state of 
the other file server 110. This file server 110 makes this 
determination in response to its own persistent memory 113 
and the mailbox disks 122. This file server 110 notes the 
state it was in before entering the REBOOTING state 240 

35 (that is, either the NORMAL state 210 or the TAKEOVER 
state 220), 

If this file server 110 determines that the other file server 
110 is in the NORMAL state 210 it proceed s with the step 
254. If this file server 110 determines that it had previously 

40 taken over all the mass storage devices 120 (that is, that the 
other file server 110 is in the STOPPED state 230 or the 
REBOOTING state 240), it proceeds with the step 255. 

At a step 254, this file server 110 attempts to seize its own 
mass storage devices 120 but not those normally assigned to 

45 the other file server 110. This file server 110 proceeds with 
the step 256. 

At a step 255, this file server 110 attempts to seize both 
its own mass storage devices 120 and those normally 
assigned to the other file server 110. This file server 110 

50 proceeds with the step 256. 

At a step 256, this file server U0 determines whether its 
persistent memory 113 is current with regard to pending file 
server operations. If not, this file server 110 flushes its 
persistent memory 113 of pending file server operations. 

55 At a step 257, this file server 110 determines if it is able 
to communicate with the other file server and if there is 
anything (such as an operator command) preventing take- 
over operations. This file server 110 makes its determination 
in response to the persistent memory 113 and the mailbox 

60 disks 122. 

At a step 258, if this file server 110 was in the NORMAL 
state 210 before entering the REBOOTING state 240 (that 
is, this file server 110 performed the step 254 and seized only 
its own mass storage devices 120), it enters the NORMAL 
65 state 210. 

At a step 258, if this file server 110 was in the TAKE- 
OVER state 220 before entering the REBOOTING state 240 
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(that is, this file server 110 performed ihe step 255 and seized 
all the mass storage devices 120, it enters the TAKEOVER 
state 220. 

ALTERNATIVE EMBODIMENTS 5 

Although preferred embodiments are disclosed herein, 
many variations are possible which remain within the 
concept, scope, and spirit of the invention, and these varia- 
tions would become clear to those skilled in the art after iQ 
perusal of this application. 

What is claimed is: 

1. A file server including 

a set of storage devices capable of being shared with a 
second file server; 15 

a controller disposed for coupling to said shared set of 
storage devices; 

a transceiver disposed for coupling to a communication 
path and for communicating messages using said com- 
munication path, said communication path using said 20 
shared set of storage devices to communicate said 
messages; 

a takeover monitor coupled to at least part of said shared 
set of storage devices, and responsive to said commu- 
nication path and said shared set of storage devices. 25 

2. A file server as in claim 1, including persistent memory 
storing state information about said file server, said takeover 
monitor being responsive to said persistent memory. 

3. Apparatus including 
a shared resource; 

a pair of servers each coupled to said shared resource and 
each disposed for managing at least part of said shared 
resource; 

a communication path disposed for coupling a sequence 35 
of messages between said pair, said communication 
path disposed for using s aid shared resource for 
coupling said sequence of messages; 

each one of said pair being disposed for takeover of at 
least part of said shared resource in response to said 40 
communication path; 

whereby said communication path prevents both of said 
pair from concurrently performing said takeover. 

4. Apparatus as in claim 3, wherein 

at least one said server includes a file server, 
said shared resource includes a storage medium; and 
said communication path includes a designated location 
on said storage medium. 

5. Apparatus as in claim 3, wherein 50 
each one of said pair includes persistent memory; 

said persistent memory being disposed for storing state 

information about said pair; and 
each one of said pair being disposed for takeover in ^ 

response to said persistent memory. 

6. Apparatus as in claim 3, wherein 

each said server is disposed for transmitting a message 
including recovery information relating to a status of 
said server on recovery from a service interruption; and 60 

each said server is disposed so that giveback of at least 
part of said shared resource is responsive to said 
recovery information. 

7. Apparatus as in claim 3, wherein 

each said server is disposed for transmitting a message 65 
including recovery information relating to a status of 
said server on recovery from a service interruption; and 
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each said server is disposed so that said takeover is 
responsive to said recovery information. 

8. Apparatus as in claim 3, wherein 

said pair includes a first server and a second server; 

said first server determines a state for itself and for said 
second server in response to said communication path; 

said second server determines a slate for itself and for said 
first server in response to said communication path; 

whereby said first server and said second server concur- 
rently each determine state for each other, such that it 
does not occur that each of said first server and said 
second server both consider the other to be inoperative. 

9. Apparatus as in claim 3, wherein 

said shared resource includes a plurality of storage 
devices; and 

said communication path includes at least part of said 
storage devices. 

10. Apparatus as in claim 3, wherein 

said communication path includes a plurality of indepen- 
dent communication paths between said pair, and 

each message in said sequence includes a generation 
number, said generation number being responsive to a 
service interruption and a persistent memory for a 
sender of said message. 

11. Apparatus as in claim 3, wherein 

said communication path includes a plurality of indepen- 
dent communication paths between said pair, and 

said first server is disposed for determining a state for 
itself and for said second server in response to a state 
of said shared resource and in response to a state of a 
persistent memory at said first server. 

12. Apparatus as in claim 3, wherein 

said communication path includes a plurality of indepen- 
dent communication paths between said pair, and 

said plurality of independent communication paths 
includes at least two of the group: a packet network, a 
shared storage element, a system area network. 

13. Apparatus as in claim 3, wherein 

said communication path is disposed for transmitting at 
least one message from a first said server to a second 
said server; 

said message indicating that said first server is attempting 
said takeover; 

receipt of said message being responsive to a state of said 
shared resource. 

14. Apparatus as in claim 13, wherein said second server 
is disposed for altering its state in response to said message, 
in said altered state refraining from writing to said shared 
resource. 

15. A method for operating a file server, said method 
including steps for controlling a subset of a set of shared 
storage devices; 

receiving and transmitting messages with a second file 
server, said steps for receiving and transmitting using a 
communication path including said shared storage 
devices; 

monitoring said communicating path and said shared 
storage devices; 

storing state information about said file server in a per- 
sistent memory; and 

performing a takeover operation of said shared resource in 
response to said steps for monitoring and a state of said 
persistent memory. 
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16. A method including steps for 

managing at a first server at least a part of a shared 
resource; 

receiving and transmitting a sequence of messages 
between said first server to a second server, using said 
shared resource; 

performing a takeover operation at a first server of at least 
part of said shared resource in response to said 
sequence of messages; 

whereby said steps for receiving and transmitting prevent 
both of said first server and said second server from 
concurrently performing said takeover operation. 

17. A method as in claim 16, including steps for 
determining, at said first server, a state for itself and for 

said second server in response to a communication 
path; 

determining, at said second server, a state for itself and for 
said first server in response to said communication 
path; 

whereby said first server and said second server concur- 
rently each determine state for each other, such that it 
does not occur that each of said first server and said 
second server both consider the other to be inoperative. 

18. A method as in claim 16, including steps for storing 
state information about said first server in a persistent 
memory, wherein said first server determines a state for itself 
in response to a state of said persistent memory. 

19. A method as in claim 16, including steps for 
transmitting, from said first server, recovery information 

relating to a status of said first server on recovery from 
a service interruption; and 
performing a giveback operation of at least part of said 
shared resource is responsive to said recovery infor- 
mation. 

20. A method as in claim 16, including steps for 
transmitting, from said first server, recovery information 

relating to a status of said server on recovery from a 
service interruption; 
wherein said steps for performing said takeover operation 
are responsive to said recovery information. 

21. A method as in claim 16, wherein 

said shared resource includes a plurality of storage 
devices; and 

a communication path includes at least part of said storage 
devices; 

whereby loss of access to said part of said storage devices 
breaks said communication path. 

22. A method as in claim 16, including steps for 
transmitting at least one message from a first said server 

to a second said server, said message indicating that 
said first server is attempting said takeover; 
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altering a state of said second server in response to said 
message; and 

in said altered state refraining from writing to said shared 
resource. 

23. A method as in claim 16, wherein a communication 
path includes a plurality of independent communication 
paths between said pair; and including steps for 

numbering said sequence of messages; 

determining, at each recipient, a unified order for mes- 
sages delivered using different ones of said plurality of 
independent communication paths; and 

determining, at said first server, a state for itself and for 
said second server in response to a state of said shared 
resource and in response to a state of a persistent 
memory at said first server. 

24. A method as in claim 16, wherein a communication 
path includes a plurality of independent communication 
paths between said pair; and including steps for 

numbering said sequence of messages; 

determining, at each recipient, a unified order for mes- 
sages delivered using different ones of said plurality of 
independent communication paths; 

transmitting substantially each message in said sequence 
on at least two of said plurality of independent com- 
munication paths, whereby there is no single point of 
failure for communication between said pair. 

25. A method as in claim 16, wherein a communication 
path includes a plurality of independent communication 
paths between said pair; and including steps for 

numbering said sequence of messages; 

determining, at each recipient, a unified order for mes- 
sages delivered using different ones of said plurality of 
independent communication paths; 

wherein said plurality of independent communication 
paths includes at least two of the group: a packet 
network, a shared storage element, a system area net- 
work. 

26. A method as in claim 16, wherein a communication 
path includes a plurality of independent communication 
paths between said pair; and including steps for 

numbering said sequence of messages; 

determining, at each recipient, a unified order for mes- 
sages delivered using different ones of said plurality of 
independent communication paths; 

wherein said steps for numbering include (a) determining 
a generation number in response to a service interrup- 
tion and a persistent memory for a sender of said 
message, and (b) providing said generation number in 
substantially each message in said sequence. 
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